STATS 32 Session 3: Data Visualization with ggplot

Damian Pavlyshyn

April 21

http://web.stanford.edu/class/stats32/lectures/

Recap of session 2: Data tables

A standard-form data table is a matrix of values where

In this lecture we will see how to use a data table to gain insight about the variables (corresponding to columns) and how they relate with each other.

This is essentially a definition of data presentation.

Data presentation

We start with a big and unwieldy table of numbers. How do we extract useful information about it?

Numerical summaries

Try this out on some vectors and dataframes

Plots and visualizations

Each row specifies a graphical element of a plot

The layered grammar of graphics

str(efficiency)
## Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  7 variables:
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cylinders   : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ weight      : num  2620 2875 2320 3215 3440 ...
##  $ horsepower  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ engine      : Factor w/ 2 levels "V-shaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
##  $ transmission: Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gears       : num  4 4 4 3 3 3 3 4 4 4 ...

Ingredients of a plot:

ggplot(data = efficiency, aes(x = weight, y = mpg, color = transmission)) +
    geom_point(size = 3) +
    ggtitle("Fuel efficiency vs vehicle weight")

ggplot(data = efficiency, aes(x = cylinders, fill = transmission)) +
    geom_bar() +
    coord_flip()

ggplot(data = efficiency, aes(x = horsepower)) +
    geom_histogram(bins = 10) +
    geom_freqpoly(aes(color = engine), bins = 10)

Two classes of variables in statistics

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Notice that number of cylinders is a number, not a factor, so it is treated as a continuous variable.

ggplot(data = mtcars, aes(x = wt, y = mpg, color = cyl)) +
    geom_point(size = 3)

But the only cylinder numbers are 4, 6 and 8, so we probably want to treat them as discrete, after all, the above graphic has a color designation for 3.56 cylinders, which isn’t at all useful!

ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
    geom_point(size = 3)

By converting the number of cylinders to the factor type, R now knows to treat it as a discrete variable and the resulting plot makes much more sense!

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

ggplot(data = efficiency, aes(x = cylinders)) +
    geom_bar() +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

ggplot(data = efficiency, aes(x = mpg)) + 
    geom_histogram() +
    ggtitle("Histogram of miles per gallon")

Not ideal: too many bins, which defeats the purpose of a histogram. We can manually specify the bins using the breaks option.

ggplot(data = efficiency, aes(x = mpg)) + 
    geom_histogram(breaks = seq(10, 35, 5)) +
    ggtitle("Histogram of miles per gallon")

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

ggplot(data = efficiency, aes(y = mpg, x = weight)) + 
    geom_point(size = 2) + 
    ggtitle("Miles per gallon vs. weight")

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

We will plot the yearly mean mpg against the year. To create the corresponding table, we use the following code, which we will explain in later lectures.

library(fueleconomy)
data(vehicles)
vehicles <- vehicles %>%
    group_by(year) %>%
    summarize(`mean highway mpg` = mean(hwy))

head(vehicles)
## # A tibble: 6 x 2
##    year `mean highway mpg`
##   <dbl>              <dbl>
## 1  1984               19.1
## 2  1985               23.0
## 3  1986               22.7
## 4  1987               22.4
## 5  1988               22.7
## 6  1989               22.5

Now, we make our usual scatterplot

ggplot(data = vehicles, aes(y = `mean highway mpg`, x = year)) +
    geom_point() +
    ggtitle("Mean highway mpg by year")

Hmmm, not so good…

Let’s replace geom_point with geom_line:

ggplot(data = vehicles, aes(y = `mean highway mpg`, x = year)) +
    geom_line() +
    ggtitle("Mean highway mpg by year")

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    ggtitle("Distribution of mpg by cylinders")

We can store parts of a plot as a variable and re-use it with different layers:

Position: Arranging bar plots

p <- ggplot(data = efficiency, aes(x = cylinders, fill = engine)) +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

In a bar plot, we have different ways of arranging the bars:

Position: Seeing obscured data

p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    ggtitle("mpg by cylinders")
Often, points will obscure one another and we need to move them out of the way to see what’s going on.

Common graphical specifcations

Aesthetics

These aesthetics are shared by many different geoms and so are good to know off the top of you head

Some geoms have special aesthetics - these are usually documented in the help file for the corresponding geom.

Geoms

We’ve gone over many of these in the previous slides, but they’re assembled in this list for reference

Shapes in R

Colors in R

Today’s dataset: World Bank data

(Source: flickr and World Bank)

DataBank homepage

Interface for World Development Indicators